Skip to main content

The Cox Proportional Hazards Model and Its Characteristics


The Formula for the Cox PH Model

The Cox PH model is usually written in terms of the hazard model formula shown here below. This model gives an expression for the hazard at time tt for an individual with a given specification of a set of explanatory variables denoted by the bold X. That is, the bold X represents a collection (sometimes called a "vector") of predictor variables that is being modeled to predict an individual's hazard.

h(t,X)=ho(t)ei=1pβiXih(t, \mathbf{X}) = h_o(t)e^{\sum_{i=1}^{p}\beta_iX_i}
X=(X1,X2,...,Xp)\mathbf{X}=(X_1,X_2,...,X_p)

where X1X_1 through XpX_p are explanatory/predictor variables

The Cox model formula says that the hazard at time tt is the product of two quantities. The first of these, h0(t)h_0(t), is called the baseline hazard function. The second quantity is the exponential expression ee to the linear sum of βiXi\beta_iX_i, where the sum is over the p explanatory XX variables.

An important feature of this formula, which concerns the proportional hazards (PH) assumption, is that the baseline hazard is a function of tt, but does not involve the XX's. In contrast, the exponential expression shown here, involves the XX's, but does not involve t. The XX's here are called time-independent XX's.

XX's involving tt: time-dependent

Requires extended Cox model (no PH)

It is possible, nevertheless, to consider XX's which do involve tt. Such XX's are called time-dependent variables. If time-dependent variables are considered, the Cox model form may still be used, but such a model no longer satisfies the PH assumption, and is called the extended Cox model.

A time-independent variable is defined to be any variable whose value for a given individual does not change over time. Examples are SEX and smoking status (SMK). Note, however, that a person's smoking status may actually change over time, but for purposes of the analysis, the SMK variable is assumed not to change once it is measured, so that only one value per individual is used.

Definition

Time-independent variable: Values for a given individual do not change over time; e.g., SEX and SMK

Also note that although variables like AGE and weight (WGT) change over time, it may be appropriate to treat such variables as time-independent in the analysis if their values do not change much over time or if the effect of such variables on survival risk depends essentially on the value at only one measurement.

The Cox model formula has the property that if all the XX's are equal to zero, the formula reduces to the baseline hazard function. That is, the exponential part of the formula becomes ee to the zero, which is 1. This property of the Cox model is the reason why h0(t)h_0(t) is called the baseline function.

Or, from a slightly different perspective, the Cox model reduces to the baseline hazard when no XX's are in the model. Thus, h0(t)h_0(t) may be considered as a starting or "baseline" version of the hazard function, prior to considering any of the XX's.

Another important property of the Cox model is that the baseline hazard, h0(t)h_0(t), is an unspecified function. It is this property that makes the Cox model a semiparametric model.

In contrast, a parametric model is one whose functional form is completely specified, except for the values of the unknown parameters. For example, the Weibull hazard model is a parametric model and has the form shown here, where the unknown parameters are λ\lambda, pp, and the βi\beta_i's. Note that for the Weibull model, h0(t)h_0(t) is given by λptp1\lambda pt^{p-1}


ML Estimation of the Cox PH Model

We now describe how estimates are obtained for the parameters of the Cox model. The parameters are the β\beta's in the general Cox model formula shown here. The corresponding estimates of these parameters are called maximum likelihood (ML) estimates and are denoted as β^i\hat{\beta}_i.

As an example of ML estimates, we consider once again the computer output for one of the models (model 2) fitted previously from remission data on 42 leukemia patients. The Cox model for this example involves two parameters, one being the coefficient of the treatment variable (denoted here as RxRx) and the other being the coefficient of the logWBClog \mathrm{WBC} variable. The expression for this model is shown at the left, which contains the estimated coefficients 1.2941.294 for RxRx and 1.6041.604 for log white blood cell count.

As with logistic regression,the ML estimates of the Cox model parameters are derived by maximizing a likelihood function, usually denoted as LL. The likelihood function is a mathematical expression which describes the joint probability of obtaining the data actually observed on the subjects in the study as a function of the unknown parameters (the β\beta's) in the model being considered. LL is sometimes written notationally as L(β)L(\beta) where β\beta denotes the collection of unknown parameters.

The expression for the likelihood is developed at the end of the chapter. However, we give a brief overview below.

The formula for the Cox model likelihood function is actually called a "partial" likelihood function rather than a (complete) likelihood function. The term "partial" likelihood is used because the likelihood formula considers probabilities only for those subjects who fail, and does not explicitly consider probabilities for those subjects who are censored. Thus the likelihood for the Cox model does not consider probabilities for all subjects, and so it is called a "partial" likelihood.


Computing the Hazard Ratio

In general, a hazard ratio (HR) is defined as the hazard for one individual divided by the hazard for a different individual. The two individuals being compared can be distinguished by their values for the set of predictors, that is, the X's. We can write the hazard ratio as the estimate of h(t,X)h(t,\mathbf{X}^*) divided by the estimate of h(t,X)h(t,\mathbf{X}), where X\mathbf{X}^* denotes the set of predictors for one individual, and X\mathbf{X} denotes the set of predictors for the other individual.

Note that, as with an odds ratio, it is easier to interpret an HR that exceeds the null value of 1 than an HR that is less than 1. Thus, the XX's are typically coded so that group with the larger hazard corresponds to X\mathbf{X}^*, and the group with the smaller hazard corresponds to X\mathbf{X}. As an example, for the remission data described previously, the placebo group is coded as X1=1X^*_1=1, and the treatment group is coded as X1=0X_1=0.

We now obtain an expression for the HR formula in terms of the regression coefficients by substituting the Coxmodel formula into the numerator and denominator of the hazard ratio expression. This substitution is shown here. Notice that the only difference in the numerator and denominator are the XX^*'s versus the XX's. Notice also that the baseline hazards will cancel out.

Example1

Suppose, for example, there is only one XX variable of interest, X1X_1, which denotes (0,1) exposure status, so that p=1p = 1. Then, the hazard ratio comparing exposed to unexposed persons is obtained by letting X1=1X_1^* = 1 and X1=0X_1 = 0 in the hazard ratio formula. The estimated hazard ratio then becomes ee to the quantity β1\beta_1 "hat" times 1 minus 0, which simplifies to ee to the β1\beta_1 "hat."

Recall the remission data printout for Model 1, which contains only the RxRx variable, again shown here. Then the estimated hazard ratio is obtained by exponentiating the coefficient 1.509, which gives the value 4.523 shown in the HR column of the output.

Example2

We now give a second example which illustrates how to compute a hazard ratio when the model does contain product terms. We consider the printout for Model 3 of the remission data shown here.

To obtain the hazard ratio for the effect of RxR x adjusted for log\log WBC using Model 3, we consider X\mathbf{X}^* and X\mathbf{X} vectors which have three components, one for each variable in the model. The X\mathbf{X}^* vector, which denotes a placebo subject, has components X1=1,X2=logX_1^*=1, X_2^*=\log WBC and 1×log1 \times \log WBC. The X\mathbf{X} vector, which denotes a treated subject, has components X1=0,X2=X_1=0, X_2= log\log WBC and X3=0×logX_3=0 \times \log WBC. Note again that, as with the previous example, the value for log WBC is treated as fixed, though unspecified. Using the general formula for the hazard ratio, we must now compute the exponential of the sum of three quantities, corresponding to the three variables in the model. Substituting the values from the printout and the values of the vectors X\mathbf{X}^* and X\mathbf{X} into this formula, we obtain the exponential expression shown here. Using algebra, this expression simplifies to the exponential of 2.355 minus 0.342 times log WBC.